Names: Leon Kwan, Doron lisiansky, Richard Clarke github repo: https://github.com/r-clarke/bioinformatics https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119290

if (!requireNamespace("knitr", quietly = TRUE))
    install.packages("knitr")
if(!("org.Hs.eg.db" %in% installed.packages()))
  BiocManager::install('org.Hs.eg.db', update = FALSE)
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
if (!("DESeq2" %in% installed.packages())) {
  BiocManager::install("DESeq2", update = FALSE)
}
if (!("EnhancedVolcano" %in% installed.packages())) {
  BiocManager::install("EnhancedVolcano", update = FALSE)
}
if (!("apeglm" %in% installed.packages())) {
  BiocManager::install("apeglm", update = FALSE)
}
if (!("pheatmap" %in% installed.packages())) {
  BiocManager::install("pheatmap", update = FALSE)
}
if (!("gprofiler2" %in% installed.packages())) {
  BiocManager::install("gprofiler2", update = FALSE)
}
if (!("clusterProfiler" %in% installed.packages())) {
  BiocManager::install("clusterProfiler", update = FALSE)
}
if (!("M3C" %in% installed.packages())) {
  BiocManager::install("M3C", update = FALSE)
}
library(M3C)
library(umap)

Attaching package: ‘umap’

The following object is masked from ‘package:M3C’:

    umap
library(Rtsne)
library(clusterProfiler)
clusterProfiler v4.0.5  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141. doi: 10.1016/j.xinn.2021.100141

Attaching package: ‘clusterProfiler’

The following object is masked from ‘package:stats’:

    filter
library(gprofiler2)
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
library ("pheatmap")
library ("RColorBrewer")
library(magrittr)
library(matrixStats)
library(ggplot2)
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply,
    parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated,
    eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
    rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min


Attaching package: ‘S4Vectors’

The following object is masked from ‘package:clusterProfiler’:

    rename

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: IRanges

Attaching package: ‘IRanges’

The following object is masked from ‘package:clusterProfiler’:

    slice

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics

Attaching package: ‘MatrixGenerics’

The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins,
    colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads,
    colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges,
    colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs,
    rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs,
    rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads,
    rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor,
    see 'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: ‘Biobase’

The following object is masked from ‘package:MatrixGenerics’:

    rowMedians

The following objects are masked from ‘package:matrixStats’:

    anyMissing, rowMedians
library(DOSE)
library('org.Hs.eg.db')
Loading required package: AnnotationDbi

Attaching package: ‘AnnotationDbi’

The following object is masked from ‘package:clusterProfiler’:

    select
library(data.table)
data.table 1.14.2 using 1 threads (see ?getDTthreads).  Latest news: r-datatable.com
**********
This installation of data.table has not detected OpenMP support. It should still work but in single-threaded mode.
This is a Mac. Please read https://mac.r-project.org/openmp/. Please engage with Apple and ask them for support. Check r-datatable.com for updates, and our Mac instructions here: https://github.com/Rdatatable/data.table/wiki/Installation. After several years of many reports of installation problems on Mac, it's time to gingerly point out that there have been no similar problems on Windows or Linux.
**********

Attaching package: ‘data.table’

The following object is masked from ‘package:SummarizedExperiment’:

    shift

The following object is masked from ‘package:GenomicRanges’:

    shift

The following object is masked from ‘package:IRanges’:

    shift

The following objects are masked from ‘package:S4Vectors’:

    first, second
###### Load the data into R.######
matrix_of_data <- as.matrix(read.table("GSE119290_Readhead_2018_RNAseq_gene_counts.txt"))
coldata<-read.csv("coldata.csv",header = T,row.names=1,stringsAsFactors=T)
##### calculate per-gene expression ranges and generating a density plot #####
matrix_of_ranges <- rowRanges(matrix_of_data, rows = NULL, cols = NULL, na.rm = FALSE, dim. = dim(matrix_of_data), useNames = NA)

vector_of_ranges <- 0
for(i in 1:26364) {
  vector_of_ranges[i] <- matrix_of_ranges[i,2]-matrix_of_ranges[i,1]
}
vector_of_ranges
   [1]     41   1767    132    132    132    132      5      5      5      5      3      3      1   1159   3270
  [16]   1598   1598     22     22     22      6   6795    780     39    124   1111     28    102   2021   7450
  [31]    888     32     15   1169   1133  98732      4   1028     41      1      1      1     14     77      6
  [46]  10563   4508     43   3752    167   4599      8   1832   4976     51   1677     99   4426     40   1061
  [61]   7033   5415   1625   5290    129     15     11   1244    367   3203   5804    274   5384   1102   3048
  [76]    307    597   4254  10132   2897   7032   3740  49505     12     32     28     10   1705   3949  13540
  [91]    706    209   6716   3721     53   2980   2161     28    117     13   2897    634     10      2    207
 [106]   2181      1    534    969      3    923    833    525   3940     10     53   3687   1822    561   1697
 [121]     25      2      0    139      0      3   1687    359    119  17325     23    188   8122     21    388
 [136]   3736   7508    374    378      1   1179   3495   3052      7   1133   3149   3152   2788   3824      2
 [151]   2872   3570    599     14     35  13253  21306      2    900  24204     20 157014    124     22      2
 [166]      2     10     50      8   6421    500    231    910   4853   1140      1     72  21989   2971   1317
 [181]    617      4     25   4932  39241     95  19906    902    897    210   2739   1447    878     15  14160
 [196]    275   8176   5336   8058    176     35   1376   2751      2    222   1121    268   7510   1895   1698
 [211]    357      2   1425   5122   1131    366    794   2488  10403   7352   2488     17     13      1   1447
 [226]      1   3904     28     28     79      2      1      1      7      4      3      2      8      8      8
 [241]      3      2      3      3      2      1      3      1      1      1      1      2      7      7      6
 [256]      2      2      3      5      5      0      1      2      2      5  11017   4457   3006    799   1200
 [271]      5     81   2735      9      3     40    532   1530     78    775    641   5069    185      3   6697
 [286]    211    582  12264      9   3009      6      2      5     18     32  21303    539    565   2331  11170
 [301]      9   2800   1331      1      0   7928   2279     69      2      2      5     68   3412   2901   3994
 [316]   2805    183     12      9      0      5      3  18070   4617    567      1     26    740   1813      2
 [331]   3029      4      5   4281  17509   8807   8857     95    151    707   2586   1738  13335   5025   2535
 [346]     49   1127      3     84      0   2144      1     11      3      6      3      2     31    238     39
 [361]      8   5195   2223     15      3   3255     72   3644  10838   1015   2591  15594   9933      5  17752
 [376]     18    536   1613   1066   4797   2635  16293      1      1     40    570  14671   3622      0   3041
 [391]     17      0      1      1      1  15623      1    518      5      6  12259     10      6   4128    129
 [406]    331  23559   8387   1018     63   4543   2741    100  22553      3  29344   4179    935   3462   4903
 [421]   1563   1042    504      4      0   7498  12001      4      6      9      4     58    342    815      1
 [436]     90     26      8   8991  21268     16      0      0   1639   1339     59   4163     79   1386    724
 [451]   5571  21117   2896   1106    843   1141  41159   2450    777     97      3      3   1144     65   1962
 [466]    187      2      4   2725   7751   1651      3     86      4   4013   1490     41  38661   2325      2
 [481]  18825    450   2091     64   1570    851      3  10816     14   1205   1767   1479   4063   2372    999
 [496]     60   1435     24      6    332   8754   3491      2    916    451   3444   3776      9    155   3773
 [511]     15    255   2095     25   6935   4965   4266   1116   6348   4278   8453     61    829    887     45
 [526]    190     45     58     61    727     13   1005   5527     58  12478   1535   8136   1081   4099      2
 [541]      2     17     13     38      0  58965   8804     48     48      8   4623   4205   2561    796   4389
 [556]      1      4     59      8   4827    268  27561      9   1472   9841  13416   1426      9   7058   5237
 [571]   1884    435    119    445  11258    212     29      4   6424  58752    133    214   2859    406   1225
 [586]   1072  20138   4261   1299   7011   2394   1175    416   3532   1964   3419    223    890   4260     18
 [601]  12015    657     21    336    795      2      3     40      0     22      8     42     19   2642    356
 [616]     76    375    863   1679  34128   7909   5890   4032     92  13020   1311   6221   2357  10820   1650
 [631]    531   1944     27   2058  11474  12352    206   3390   9894   2738    775   2917      3   1165      0
 [646]     60    169      4   2908     28   1154   3216   3662     42   1609   5663     42   3752   1483   5248
 [661]   1703   1228   5631   1294   2270   1357      6     17   1455   1736   1681     58     30   8027  17851
 [676]  51909  10255    796    307   5994   1449     22    263      6   1516   3173    215     36   1138   1712
 [691]   3398  26685   6738   2568      3   3967    887   2473    480    611    620    300   3409    666   2968
 [706]      2      1     95    207   7894    491    138   2957   1590     38   6122      1      1   3461    878
 [721]     13    705    167   2014  61381    228   2941    467     53   1262   1054    461     15  27625     87
 [736]    127   6476     20    183      5      9    783     12  13014   4541     16   1813   3820      4    838
 [751]  35040   5807    630   1535     12      4    115   2705   2215   5474   6074    413  24280     27   2039
 [766]   4712     80   9712      7    872     42   8123  41389     59     42     40     58     16   1217     76
 [781]    501    771   2002   2805   2923   1185     73    130   1854   1299    114    214    674  23224   5373
 [796]  18910     93   4042     36   1326   1226   5819   8565      6     23   6757    363   2342   6000   8649
 [811]   1486    251      2      1     13    133      3    687    363   2050      7    230   6918      3      0
 [826]      2     26      2      2      1     17    783   3027   5010     20      6     29     13    128      7
 [841]    246   2595     18    751      1      0    850     69   2554   2882      0      0      0   4508     35
 [856]      1   3802   4418   6728     47    498   3875    711     15   4362   2074   1724   2482   4247   4964
 [871]    311     18   1248   4877    206    160   3938      4      5      2    752     34   1137   2592   2660
 [886]    634  14466     19      1      1     10   5272   1454     27   1644   2090      1   4392    843      1
 [901]      7    629   4351  10736    275     60     22      2   1199   1246    486      4      8  38718      3
 [916]      1   1576   7683     76      0  11118   1959     22      2      6   2343      5    291    187   2998
 [931]  24724      6     25      7   1212      2     12    291     25      1    101     14   6835      7      3
 [946]    607   1816      1      1     11    105   7893   8324     24    923      3     99     83      0    602
 [961]   1349    750     82   3090   1334      9      1   8227      2   2002   3823     11     11      1   5001
 [976]   1142   4180    851   8345      8    194      1     36      7    419   2152   1692     11      2     23
 [991]  29496   3526  15995    355   4054  37225      7   2697   2902     38
 [ reached getOption("max.print") -- omitted 25364 entries ]
plot(density(vector_of_ranges), log='x')
Warning in xy.coords(x, y, xlabel, ylabel, log) :
  1 x value <= 0 omitted from logarithmic plot

The size of the expression matrix generated is 26,364 x 44. There are 26,364 different genes with 44 samples. Looking at the variation of counts, there are a few instances of very large ranges with majority of ranges being smaller. The density of gene variation follows an expected poisson distribution with lambda 1.There are some genes with high expression ranges, but most genes are not as variable.

##### generate a PCA plot #######
dds <- DESeqDataSetFromMatrix(countData = matrix_of_data,colData = coldata,design = ~ dex)
dds <- dds[rowSums(counts(dds)) > 1,]

vst <- vst(dds,blind = FALSE)
plotPCA(vst, intgroup=c("dex"))

##### generate a UMAP plot #######
matrix_of_data_t <- t(matrix_of_data)
df.umap = umap(matrix_of_data_t)

dimension1 <- range(df.umap$layout)
dimension2 <- dimension1
plot(dimension1, dimension2, type="n")
points(df.umap$layout[,1], df.umap$layout[,2], col=as.integer(coldata[,"dex"])+2, cex=2, pch=20)
labels.u = unique(coldata[,"dex"])
legend.text = as.character(labels.u)
legend.pos = "bottomleft"
legend.text = paste(as.character(labels.u), "")
 
legend(legend.pos, legend=legend.text, inset=0.03,col=as.integer(labels.u)+2,bty="n", pch=20, cex=1)

Neither UMAP nor PCA, show clear clustering. However, it should be noted that our dataset may not follow the assumptions necessary for dimensionality reduction for both PCA and UMAP. PCA assumes linearity in the dataset and UMAP assumes uniform distribution on a Reimann manifold.

#### Perform differential analysis on the samples from your two groups #####

deseq_object <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 308 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
deseq_results <- results(deseq_object)
deseq_results <- lfcShrink(deseq_object,coef = 2, res = deseq_results )
using 'apeglm' for LFC shrinkage. If used in published research, please cite:
    Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
    sequence count data: removing the noise and preserving large differences.
    Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
head(deseq_results)
log2 fold change (MAP): dex SZ vs Control 
Wald test p-value: dex SZ vs Control 
DataFrame with 6 rows and 5 columns
           baseMean log2FoldChange     lfcSE    pvalue      padj
          <numeric>      <numeric> <numeric> <numeric> <numeric>
DDX11L1     15.2075      0.0185428  0.144337  0.785012  0.891035
WASH7P    1157.3822      0.1255759  0.110817  0.140908  0.348768
MIR6859-3   84.1270      0.1824412  0.131918  0.057109  0.200511
MIR6859-2   84.1270      0.1824412  0.131918  0.057109  0.200511
MIR6859-4   84.1270      0.1824412  0.131918  0.057109  0.200511
MIR6859-1   84.1270      0.1824412  0.131918  0.057109  0.200511
deseq_df <- deseq_results %>%
  # make into data.frame
  as.data.frame() %>%
  # the gene names are row names -- let's make them a column for easy display
  tibble::rownames_to_column("Gene") %>%
  # add a column for significance threshold results
  dplyr::mutate(threshold = padj < 0.05) %>%
  # sort by statistic -- the highest values will be genes with
  # higher expression
  dplyr::arrange(dplyr::desc(log2FoldChange))
head(deseq_df)
plotCounts(dds, gene = "DDX11L1", intgroup = "dex")

We are performing differential expression analysis. In our DESeq2 object we designated our dex variable which represent the group as the model argument. Because of this, the DESeq function will use groups defined by dex to test for differential expression. We will use lfcShrink in order to decrease noise and preserve large differences between groups. We can see for example that The DDX11L1 gene samples have similar expression in both groups.

Volcano plot

#volcano plot here
volcano_plot <- EnhancedVolcano::EnhancedVolcano(
  deseq_df,lab = deseq_df$Gene,x = "log2FoldChange",
  y = "padj",pCutoff = 0.01 )
Warning: Ignoring unknown parameters: xlim, ylim
volcano_plot

Summary of volcano plot

in volcano plot we are presenting analysis between two groups, which presents the log fold-change vs. the log10 p-value.

Heatmap

#heatmap
vst <- vst(dds,blind = FALSE)

sampleDists <- dist(t(assay(vst)))
sampleDists
             Sample_LI-01 Sample_LI-02 Sample_LI-03 Sample_LI-04 Sample_LI-05 Sample_LI-06 Sample_LI-07
Sample_LI-02    144.42095                                                                              
Sample_LI-03     96.21431    133.97114                                                                 
Sample_LI-04     99.87450    139.71166     37.77497                                                    
Sample_LI-05     83.60813    156.00739    101.78338    103.29480                                       
Sample_LI-06    119.44816    116.22757    120.67800    123.34283     96.11964                          
Sample_LI-07     95.37821    155.36706    112.59969    113.70332     76.26127    115.94728             
Sample_LI-08     95.26113    141.14019    109.07132    108.09854     78.42244    101.09101     39.36980
Sample_LI-09     80.44241    130.44555     99.27533    104.59342     58.80801     75.32779     84.06512
Sample_LI-10     84.61049    159.95587    101.84089     97.85090     50.30949    101.46811     77.90681
Sample_LI-11     89.07397    124.77483     80.79587     83.05840     91.87756    124.53992     90.01049
Sample_LI-12     42.10256    153.45828     95.05450     93.44808     80.13373    119.25132     91.88427
Sample_LI-13     93.57608    146.53943     89.62124     92.59267     71.68491    108.62476     78.17900
Sample_LI-14     98.06298    144.25071     93.04808     91.18505     75.18324    100.26206     80.47143
Sample_LI-15     88.27419    145.39448     96.89075     99.89449     72.05981    108.87130     68.31503
Sample_LI-16     91.47799    134.87317     95.56142     94.36748     76.91570     96.16823     75.87451
Sample_LI-17    110.21659    156.31258    108.56673    112.58090     93.57011    121.32108     79.21296
Sample_LI-18    115.32256    165.33256    111.34182    106.54065     96.07806    125.66399     82.89720
Sample_LI-19    117.31859    167.04564    123.81656    126.89681    116.25921    147.07707     92.33292
Sample_LI-20    117.22968    160.07730    119.69261    118.73346    114.31221    136.86663     91.68664
Sample_LI-23    115.82222    159.04777    117.81344    119.68953    106.69378    132.85037     86.20527
Sample_LI-24    117.43344    161.46974    117.82703    114.48324    106.97566    133.33938     86.94077
Sample_LI-25    114.69350    134.34106     78.37575     79.54452    105.88925    117.89541    115.89225
Sample_LI-27    107.69827    150.74338     98.20624    102.74555    101.89783    130.49242    101.04756
             Sample_LI-08 Sample_LI-09 Sample_LI-10 Sample_LI-11 Sample_LI-12 Sample_LI-13 Sample_LI-14
Sample_LI-02                                                                                           
Sample_LI-03                                                                                           
Sample_LI-04                                                                                           
Sample_LI-05                                                                                           
Sample_LI-06                                                                                           
Sample_LI-07                                                                                           
Sample_LI-08                                                                                           
Sample_LI-09     78.70855                                                                              
Sample_LI-10     75.79231     61.25290                                                                 
Sample_LI-11     90.30398     96.56694     92.11518                                                    
Sample_LI-12     91.16393     84.22445     73.28430     86.55308                                       
Sample_LI-13     80.65945     81.27663     78.69382     80.22686     89.71738                          
Sample_LI-14     76.53043     83.31679     75.45347     85.40616     90.37432     37.77026             
Sample_LI-15     72.22288     79.79228     80.26742     82.77390     86.44219     39.10570     46.18250
Sample_LI-16     68.34821     79.38293     78.20575     85.62742     86.45278     48.33529     39.87597
Sample_LI-17     81.19500     97.19180     98.54813    102.53927    108.27463     81.33364     84.02504
Sample_LI-18     79.18897    107.02390     90.33651    103.42611    104.93807     87.07794     80.62000
Sample_LI-19     97.36230    118.90778    119.27992    114.69949    117.15625    106.04336    110.89265
Sample_LI-20     89.17978    115.67221    112.69894    112.69567    113.06653    104.30384    102.93066
Sample_LI-23     87.29910    108.25278    108.02152    108.10184    113.69641     98.01200     99.75891
Sample_LI-24     83.42641    112.45640    102.49730    107.78078    110.80109    100.49780     96.56208
Sample_LI-25    112.34720    106.37446    107.46017     83.09330    111.47755     81.20942     83.53271
Sample_LI-27    103.66152    101.12589    103.96036     95.66504    108.89038     90.49695     96.61505
             Sample_LI-15 Sample_LI-16 Sample_LI-17 Sample_LI-18 Sample_LI-19 Sample_LI-20 Sample_LI-23
Sample_LI-02                                                                                           
Sample_LI-03                                                                                           
Sample_LI-04                                                                                           
Sample_LI-05                                                                                           
Sample_LI-06                                                                                           
Sample_LI-07                                                                                           
Sample_LI-08                                                                                           
Sample_LI-09                                                                                           
Sample_LI-10                                                                                           
Sample_LI-11                                                                                           
Sample_LI-12                                                                                           
Sample_LI-13                                                                                           
Sample_LI-14                                                                                           
Sample_LI-15                                                                                           
Sample_LI-16     40.29908                                                                              
Sample_LI-17     81.23823     85.23606                                                                 
Sample_LI-18     87.91855     82.64446     52.15115                                                    
Sample_LI-19    101.00510    106.81797     76.73758     90.05378                                       
Sample_LI-20     99.72611     98.16905     76.69849     76.56972     43.88827                          
Sample_LI-23     93.77137     96.33758     61.38807     72.16698     58.01886     58.54225             
Sample_LI-24     95.63683     92.77982     71.02220     61.30604     70.22445     56.41068     42.82230
Sample_LI-25     92.85541     90.58481    116.67682    116.47200    136.97377    131.50994    126.77162
Sample_LI-27     95.68051    100.95401    104.69435    111.40654    113.49061    114.35801    111.71844
             Sample_LI-24 Sample_LI-25 Sample_LI-27 Sample_LI-28 Sample_LI-29 Sample_LI-30 Sample_LI-31
Sample_LI-02                                                                                           
Sample_LI-03                                                                                           
Sample_LI-04                                                                                           
Sample_LI-05                                                                                           
Sample_LI-06                                                                                           
Sample_LI-07                                                                                           
Sample_LI-08                                                                                           
Sample_LI-09                                                                                           
Sample_LI-10                                                                                           
Sample_LI-11                                                                                           
Sample_LI-12                                                                                           
Sample_LI-13                                                                                           
Sample_LI-14                                                                                           
Sample_LI-15                                                                                           
Sample_LI-16                                                                                           
Sample_LI-17                                                                                           
Sample_LI-18                                                                                           
Sample_LI-19                                                                                           
Sample_LI-20                                                                                           
Sample_LI-23                                                                                           
Sample_LI-24                                                                                           
Sample_LI-25    125.40196                                                                              
Sample_LI-27    114.67548    101.83543                                                                 
             Sample_LI-32 Sample_LI-33 Sample_LI-34 Sample_LI-36 Sample_LI-37 Sample_LI-38 Sample_LI-39
Sample_LI-02                                                                                           
Sample_LI-03                                                                                           
Sample_LI-04                                                                                           
Sample_LI-05                                                                                           
Sample_LI-06                                                                                           
Sample_LI-07                                                                                           
Sample_LI-08                                                                                           
Sample_LI-09                                                                                           
Sample_LI-10                                                                                           
Sample_LI-11                                                                                           
Sample_LI-12                                                                                           
Sample_LI-13                                                                                           
Sample_LI-14                                                                                           
Sample_LI-15                                                                                           
Sample_LI-16                                                                                           
Sample_LI-17                                                                                           
Sample_LI-18                                                                                           
Sample_LI-19                                                                                           
Sample_LI-20                                                                                           
Sample_LI-23                                                                                           
Sample_LI-24                                                                                           
Sample_LI-25                                                                                           
Sample_LI-27                                                                                           
             Sample_LI-40 Sample_LI-41 Sample_LI-42 Sample_LI-43 Sample_LI-44 Sample_LI-45 Sample_LI-46
Sample_LI-02                                                                                           
Sample_LI-03                                                                                           
Sample_LI-04                                                                                           
Sample_LI-05                                                                                           
Sample_LI-06                                                                                           
Sample_LI-07                                                                                           
Sample_LI-08                                                                                           
Sample_LI-09                                                                                           
Sample_LI-10                                                                                           
Sample_LI-11                                                                                           
Sample_LI-12                                                                                           
Sample_LI-13                                                                                           
Sample_LI-14                                                                                           
Sample_LI-15                                                                                           
Sample_LI-16                                                                                           
Sample_LI-17                                                                                           
Sample_LI-18                                                                                           
Sample_LI-19                                                                                           
Sample_LI-20                                                                                           
Sample_LI-23                                                                                           
Sample_LI-24                                                                                           
Sample_LI-25                                                                                           
Sample_LI-27                                                                                           
             Sample_LI-47
Sample_LI-02             
Sample_LI-03             
Sample_LI-04             
Sample_LI-05             
Sample_LI-06             
Sample_LI-07             
Sample_LI-08             
Sample_LI-09             
Sample_LI-10             
Sample_LI-11             
Sample_LI-12             
Sample_LI-13             
Sample_LI-14             
Sample_LI-15             
Sample_LI-16             
Sample_LI-17             
Sample_LI-18             
Sample_LI-19             
Sample_LI-20             
Sample_LI-23             
Sample_LI-24             
Sample_LI-25             
Sample_LI-27             
 [ reached getOption("max.print") -- omitted 20 rows ]
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( vst$dex, sep = " - " )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)

#Heatmap of sample-to-sample distances
pheatmap(sampleDistMatrix,clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,col = colors)


topVarGenes <- head(order(rowVars(assay(vst)), decreasing = TRUE), 20)
topVarGenes
 [1] 23946 23945 24456 15776 24464  1660 10189 24311  1652 24481  6057 24463 22385 15849 24483 10106 10188
[18] 20848 10186 21054
mat  <- assay(vst)[topVarGenes, ]

mat<-mat-rowMeans(mat)
anno <- as.data.frame(colData(vst)[c("dex")])
pheatmap(mat, annotation_col = anno)

Summary of heatmap plot we calculate the distance matrix and apply clustering , In the sample distance heatmap made previously, the dendrogram at the side shows us a hierarchical clustering of the samples. Such a clustering can also be performed for the genes. Since the clustering is only relevant for genes that actually carry a signal, one usually would only cluster a subset of the most highly variable genes. Here, for demonstration, we are selecting the 20 genes with the highest variance across samples. Treatment status and cell line information are shown with colored bars at the top of the heatmap. Blocks of genes that covary across patients.

Method/Ontology 1

#Method 1 code
topVarGenesGO <- head(order(rowVars(assay(vst)), decreasing = TRUE), 100)
topVarGenesGO
  [1] 23946 23945 24456 15776 24464  1660 10189 24311  1652 24481  6057 24463 22385 15849 24483 10106 10188
 [18] 20848 10186 21054 17462 16399 24462    22 24480 15850 24129  2580 20649 24457 20650  9553 24465 12523
 [35]  4972  4259 10185 22806 10187 10184 15491  3454 20648  9739 21229 18715 17622 21773  3455 22294  7321
 [52] 10181 10908 20645 21227 23012 20653  5159  4477  5001  3052 15389 23982 13827 21651  3453 20454  3720
 [69]  4449 17605 19980 11498 17055  1707 17248  6386  6061  1509 20557 23515 23131  3711  3755 19040 22399
 [86]  6996  5606  2536 18929 11051  6170  1511 10518 18091 15206  2875 20651 14865  7057  2894
mat_100  <- assay(vst)[topVarGenesGO, ]

gostres <- gost(query = rownames(mat_100), 
                organism = "hsapiens", ordered_query = FALSE, 
                multi_query = FALSE, significant = TRUE, exclude_iea = FALSE, 
                measure_underrepresentation = FALSE, evcodes = FALSE, 
                user_threshold = 0.05, correction_method = "g_SCS", 
                domain_scope = "annotated", custom_bg = NULL, 
                numeric_ns = "", sources = NULL, as_short_link = FALSE)

names(gostres)
[1] "result" "meta"  
head(gostres$result, 6)
names(gostres$meta)
[1] "query_metadata"  "result_metadata" "genes_metadata"  "timestamp"       "version"        
plot1 <- gostplot(gostres, capped = TRUE, interactive = TRUE)
plot1
plot <- gostplot(gostres, capped = FALSE, interactive = FALSE)
plot

publish_gosttable(gostres, highlight_terms = gostres$result[c(1),],
                  use_colors = TRUE, 
                  show_columns = c("source", "term_name", "term_size", "intersection_size"),
                  filename = NULL)
The input 'highlight_terms' is a data.frame. The column 'term_id' will be used.

summary of Method/Ontology 1 method :gprofiler2 Ontology : Gene Ontology In this enrichment analysis we Extracted the list of the top 100 differentially expressed genes and run gost function. The function enables to perform functional profiling of gene lists. The function performs statistical enrichment analysis to find over-representation of functions from Gene Ontology The x-axis represents the functional terms that are grouped and color-coded according to data sources and positioned according to the fixed “source_order”. The order is defined in a way that terms that are closer to each other in the source hierarchy are also next to each other in the Manhattan plot The y-axis shows the adjusted p-values in negative log10 scale. Every circle is one term and is sized according to the term size, i.e larger terms have larger circles.

Method/Ontology 2

#Method 2 code
#get differentially expressed genes
dds <- DESeqDataSetFromMatrix(countData = matrix_of_data,colData = coldata,design = ~ dex)
res <- results(DESeq(dds))
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 308 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
#store values
geneList = res[,2]
#create list of genes
names_list = rownames(res)

#lead gene names into symbols variable
symbols <- rownames(res)
#map symbols to IDs
gene_ids <- mapIds(org.Hs.eg.db, symbols, 'ENTREZID', 'SYMBOL')
'select()' returned 1:many mapping between keys and columns
#update ID's in list of gene names
for(i in 1:26364){
  names_list[i] = gene_ids[names_list[i]]
}
names(geneList) = names_list
geneList = sort(geneList, decreasing = TRUE)
gene <- names(geneList)[abs(geneList) > 1.5]
x <- enrichDO(gene          = gene,
              ont           = "DO",
              pvalueCutoff  = 0.05,
              pAdjustMethod = "BH",
              universe      = names(geneList),
              minGSSize     = 5,
              maxGSSize     = 500,
              qvalueCutoff  = 0.05,
              readable      = TRUE)
output_table <- x@result
setDT(output_table)
output_table
y <- gseDO(geneList,
           minGSSize     = 120,
           pvalueCutoff  = 0.2,
           pAdjustMethod = "BH",
           verbose       = FALSE)
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (6.27% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are duplicate gene names, fgsea may produce unexpected results.
output_table <- y@result
setDT(output_table)
output_table

Summary of Method/Ontology 2 Method :clustProfiler Ontology : Disease Ontology

The results of the enrichment provide a table that did not contain any specific indicatation of schizophrenia. When ordering the table based on p-value, some related disorders to schizophrenia include dementia and bipolar disorder which are psychological disorders, however not schizophrenia specifically. There can be many reasons for this lack of findings, however the most possible might be that schizophrenia is not in the database used in the disease ontology for this enrichment.

##### clustProfiler Gene Ontology#####
ego3 <- gseGO(geneList     = geneList,
              OrgDb        = org.Hs.eg.db,
              ont          = "CC",
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.05,
              verbose      = FALSE)
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (6.27% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are duplicate gene names, fgsea may produce unexpected results.
Warning in fgseaMultilevel(...) :
  For some pathways, in reality P-values are less than 1e-10. You can set the `eps` argument to zero for better estimation.
output_table <- ego3@result
setDT(output_table)
output_table

As expected, the gene sets from the gene ontology are clearly related to schizophrenia, a neurodevelopmental disorder. The first few entries of the table all relate directly with significant differences with genes for the synapse, in integral part of neuron interaction for the to brain function. For example, a decrease in postsynaptic density, listed fourth on the table, dendritic spine density, also listed in the table, has been linked to SZ in postmortem brain tissue [1].

[1] Amber Berdenis van Berlekom, Cita H Muflihah, Gijsje J L J Snijders, Harold D MacGillavry, Jinte Middeldorp, Elly M Hol, René S Kahn, Lot D de Witte, Synapse Pathology in Schizophrenia: A Meta-analysis of Postsynaptic Elements in Postmortem Brain Studies, Schizophrenia Bulletin, Volume 46, Issue 2, March 2020, Pages 374–386, https://doi.org/10.1093/schbul/sbz060

---
title: "R Notebook"
output: html_notebook
---
Names: Leon Kwan, Doron lisiansky, Richard Clarke
github repo: https://github.com/r-clarke/bioinformatics
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119290

```{r}
if (!requireNamespace("knitr", quietly = TRUE))
    install.packages("knitr")
if(!("org.Hs.eg.db" %in% installed.packages()))
  BiocManager::install('org.Hs.eg.db', update = FALSE)
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
if (!("DESeq2" %in% installed.packages())) {
  BiocManager::install("DESeq2", update = FALSE)
}
if (!("EnhancedVolcano" %in% installed.packages())) {
  BiocManager::install("EnhancedVolcano", update = FALSE)
}
if (!("apeglm" %in% installed.packages())) {
  BiocManager::install("apeglm", update = FALSE)
}
if (!("pheatmap" %in% installed.packages())) {
  BiocManager::install("pheatmap", update = FALSE)
}
if (!("gprofiler2" %in% installed.packages())) {
  BiocManager::install("gprofiler2", update = FALSE)
}
if (!("clusterProfiler" %in% installed.packages())) {
  BiocManager::install("clusterProfiler", update = FALSE)
}
if (!("M3C" %in% installed.packages())) {
  BiocManager::install("M3C", update = FALSE)
}
```


```{r message=FALSE, warning=FALSE}
library(M3C)
library(umap)
library(Rtsne)
library(clusterProfiler)
library(gprofiler2)
library ("pheatmap")
library ("RColorBrewer")
library(magrittr)
library(matrixStats)
library(ggplot2)
library(DESeq2)
library(DOSE)
library('org.Hs.eg.db')
library(data.table)
```



```{r}
###### Load the data into R.######
matrix_of_data <- as.matrix(read.table("GSE119290_Readhead_2018_RNAseq_gene_counts.txt"))
coldata<-read.csv("coldata.csv",header = T,row.names=1,stringsAsFactors=T)
```

```{r}
##### calculate per-gene expression ranges and generating a density plot #####
matrix_of_ranges <- rowRanges(matrix_of_data, rows = NULL, cols = NULL, na.rm = FALSE, dim. = dim(matrix_of_data), useNames = NA)

vector_of_ranges <- 0
for(i in 1:26364) {
  vector_of_ranges[i] <- matrix_of_ranges[i,2]-matrix_of_ranges[i,1]
}
vector_of_ranges
plot(density(vector_of_ranges), log='x')
```
The size of the expression matrix generated is 26,364 x 44. There are 26,364 different genes with 44 samples. Looking at the variation of counts, there are a few instances of very large ranges with majority of ranges being smaller.
The density of gene variation follows an expected poisson distribution with lambda 1.There are some genes with high expression ranges, but most genes are not as variable.


```{r}
##### generate a PCA plot #######
dds <- DESeqDataSetFromMatrix(countData = matrix_of_data,colData = coldata,design = ~ dex)
dds <- dds[rowSums(counts(dds)) > 1,]

vst <- vst(dds,blind = FALSE)
plotPCA(vst, intgroup=c("dex"))
```
```{r}
##### generate a UMAP plot #######
matrix_of_data_t <- t(matrix_of_data)
df.umap = umap(matrix_of_data_t)

dimension1 <- range(df.umap$layout)
dimension2 <- dimension1
plot(dimension1, dimension2, type="n")
points(df.umap$layout[,1], df.umap$layout[,2], col=as.integer(coldata[,"dex"])+2, cex=2, pch=20)
labels.u = unique(coldata[,"dex"])
legend.text = as.character(labels.u)
legend.pos = "bottomleft"
legend.text = paste(as.character(labels.u), "")
 
legend(legend.pos, legend=legend.text, inset=0.03,col=as.integer(labels.u)+2,bty="n", pch=20, cex=1)
```

Neither UMAP nor PCA, show clear clustering. However, it should be noted that our dataset may not follow the assumptions necessary for dimensionality reduction for both PCA and UMAP. PCA assumes linearity in the dataset and UMAP assumes uniform distribution on a Reimann manifold.


```{r}
#### Perform differential analysis on the samples from your two groups #####

deseq_object <- DESeq(dds)
deseq_results <- results(deseq_object)
deseq_results <- lfcShrink(deseq_object,coef = 2, res = deseq_results )
head(deseq_results)

deseq_df <- deseq_results %>%
  # make into data.frame
  as.data.frame() %>%
  # the gene names are row names -- let's make them a column for easy display
  tibble::rownames_to_column("Gene") %>%
  # add a column for significance threshold results
  dplyr::mutate(threshold = padj < 0.05) %>%
  # sort by statistic -- the highest values will be genes with
  # higher expression
  dplyr::arrange(dplyr::desc(log2FoldChange))
head(deseq_df)
plotCounts(dds, gene = "DDX11L1", intgroup = "dex")
```
We are performing differential expression analysis. In our DESeq2 object we designated our dex variable which represent the group as the model argument. Because of this, the DESeq function will use groups defined by dex to test for differential expression.
We will use lfcShrink in order to decrease noise and preserve large differences between groups. We can see for example that The DDX11L1 gene samples have similar expression in both groups.

Volcano plot

```{r}
#volcano plot here
volcano_plot <- EnhancedVolcano::EnhancedVolcano(
  deseq_df,lab = deseq_df$Gene,x = "log2FoldChange",
  y = "padj",pCutoff = 0.01 )

volcano_plot
```

Summary of volcano plot

in volcano plot we are presenting analysis between two groups, which presents the log fold-change vs. the log10 p-value.

Heatmap

```{r}
#heatmap
vst <- vst(dds,blind = FALSE)

sampleDists <- dist(t(assay(vst)))
sampleDists


sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( vst$dex, sep = " - " )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)

#Heatmap of sample-to-sample distances
pheatmap(sampleDistMatrix,clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,col = colors)

topVarGenes <- head(order(rowVars(assay(vst)), decreasing = TRUE), 20)
topVarGenes
mat  <- assay(vst)[topVarGenes, ]

mat<-mat-rowMeans(mat)
anno <- as.data.frame(colData(vst)[c("dex")])
pheatmap(mat, annotation_col = anno)
```
Summary of heatmap plot
we calculate the distance matrix and apply clustering , 
In the sample distance heatmap made previously, the dendrogram at the side shows us a hierarchical clustering of the samples. Such a clustering can also be performed for the genes. Since the clustering is only relevant for genes that actually carry a signal, one usually would only cluster a subset of the most highly variable genes. Here, for demonstration, we are selecting the 20 genes with the highest variance across samples.
Treatment status and cell line information are shown with colored bars at the top of the heatmap. Blocks of genes that covary across patients.


Method/Ontology 1
```{r}
#Method 1 code
topVarGenesGO <- head(order(rowVars(assay(vst)), decreasing = TRUE), 100)
topVarGenesGO
mat_100  <- assay(vst)[topVarGenesGO, ]

gostres <- gost(query = rownames(mat_100), 
                organism = "hsapiens", ordered_query = FALSE, 
                multi_query = FALSE, significant = TRUE, exclude_iea = FALSE, 
                measure_underrepresentation = FALSE, evcodes = FALSE, 
                user_threshold = 0.05, correction_method = "g_SCS", 
                domain_scope = "annotated", custom_bg = NULL, 
                numeric_ns = "", sources = NULL, as_short_link = FALSE)

names(gostres)
head(gostres$result, 6)
names(gostres$meta)
plot1 <- gostplot(gostres, capped = TRUE, interactive = TRUE)
plot1
plot <- gostplot(gostres, capped = FALSE, interactive = FALSE)
plot
publish_gosttable(gostres, highlight_terms = gostres$result[c(1),],
                  use_colors = TRUE, 
                  show_columns = c("source", "term_name", "term_size", "intersection_size"),
                  filename = NULL)


```

summary of Method/Ontology 1
method :gprofiler2
Ontology : Gene Ontology
In this enrichment analysis we Extracted the list of the top 100 differentially expressed genes and run gost function.
The function enables to perform functional profiling of gene lists. The function performs statistical enrichment analysis to find over-representation of functions from Gene Ontology
The x-axis represents the functional terms that are grouped and color-coded according to data sources and positioned according to the fixed “source_order”. The order is defined in a way that terms that are closer to each other in the source hierarchy are also next to each other in the Manhattan plot
The y-axis shows the adjusted p-values in negative log10 scale. Every circle is one term and is sized according to the term size, i.e larger terms have larger circles.

Method/Ontology 2
```{r}
#Method 2 code
#get differentially expressed genes
dds <- DESeqDataSetFromMatrix(countData = matrix_of_data,colData = coldata,design = ~ dex)
res <- results(DESeq(dds))
#store values
geneList = res[,2]
#create list of genes
names_list = rownames(res)

#lead gene names into symbols variable
symbols <- rownames(res)
#map symbols to IDs
gene_ids <- mapIds(org.Hs.eg.db, symbols, 'ENTREZID', 'SYMBOL')

#update ID's in list of gene names
for(i in 1:26364){
  names_list[i] = gene_ids[names_list[i]]
}
names(geneList) = names_list
geneList = sort(geneList, decreasing = TRUE)
gene <- names(geneList)[abs(geneList) > 1.5]
x <- enrichDO(gene          = gene,
              ont           = "DO",
              pvalueCutoff  = 0.05,
              pAdjustMethod = "BH",
              universe      = names(geneList),
              minGSSize     = 5,
              maxGSSize     = 500,
              qvalueCutoff  = 0.05,
              readable      = TRUE)
output_table <- x@result
setDT(output_table)
output_table
```
```{r}
y <- gseDO(geneList,
           minGSSize     = 120,
           pvalueCutoff  = 0.2,
           pAdjustMethod = "BH",
           verbose       = FALSE)
output_table <- y@result
setDT(output_table)
output_table
```

Summary of Method/Ontology 2
Method :clustProfiler
Ontology : Disease Ontology

The results of the enrichment provide a table that did not contain any specific indicatation of schizophrenia. When ordering the table based on p-value, some related disorders to schizophrenia include dementia and bipolar disorder which are psychological disorders, however not schizophrenia specifically. There can be many reasons for this lack of findings, however the most possible might be that schizophrenia is not in the database used in the disease ontology for this enrichment.


```{r}
##### clustProfiler Gene Ontology#####
ego3 <- gseGO(geneList     = geneList,
              OrgDb        = org.Hs.eg.db,
              ont          = "CC",
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.05,
              verbose      = FALSE)
output_table <- ego3@result
setDT(output_table)
output_table
```

As expected, the gene sets from the gene ontology are clearly related to schizophrenia, a neurodevelopmental disorder. The first few entries of the table all relate directly with significant differences with genes for the synapse, in integral part of neuron interaction for the to brain function. For example, a decrease in postsynaptic density, listed fourth on the table, dendritic spine density, also listed in the table, has been linked to SZ in postmortem brain tissue [1].

[1] Amber Berdenis van Berlekom, Cita H Muflihah, Gijsje J L J Snijders, Harold D MacGillavry, Jinte Middeldorp, Elly M Hol, René S Kahn, Lot D de Witte, Synapse Pathology in Schizophrenia: A Meta-analysis of Postsynaptic Elements in Postmortem Brain Studies, Schizophrenia Bulletin, Volume 46, Issue 2, March 2020, Pages 374–386, https://doi.org/10.1093/schbul/sbz060